Task 1: Problem definition¶
Objectives for the Data Pipeline:¶
Objective 1: Identify Location and Room-Type with Highest Investment Return:
This involves analyzing historical resale price data to determine which locations and types of rooms (e.g., 4-room, 5-room, etc.) have shown the highest return on investment over the past years. It aims to provide insights for potential property investors or homeowners looking to maximize their returns.
Objective 2: Predict Price Growth for Future Using Machine Learning:
This objective involves building predictive models using machine learning algorithms on historical resale price data to forecast future price growth. By leveraging past trends and patterns, the goal is to provide stakeholders with insights into potential future price movements, aiding in decision-making processes related to property investment.
Measurement of Success:¶
The success of the data pipeline can be measured based on the following criteria:
- Accuracy of Predictions: For Objective 1, success can be measured by the accuracy of identifying locations and room types with the highest investment return based on historical data analysis. For Objective 2, success can be measured by the accuracy of the machine learning models in predicting future price growth.
- Return on Investment (ROI): Success can also be evaluated based on the ROI achieved by stakeholders who utilize the insights provided by the data pipeline. Higher ROI indicates the effectiveness of the insights generated.
- User Feedback: Soliciting feedback from stakeholders who utilize the insights generated by the data pipeline can also be a valuable measure of success. Positive feedback indicating that the insights are actionable and useful would signify success.
Use Cases for the Data Pipeline:¶
The data pipeline will serve various use cases including:
- Reporting: Generating reports on historical resale price trends, identifying locations and room types with the highest investment return, and providing summaries of machine learning predictions for future price growth.
- Analytics: Performing in-depth analysis of historical resale price data to uncover trends, patterns, and correlations that can inform decision-making related to property investment.
- Machine Learning: Developing and deploying machine learning models to predict future price growth based on historical data, enabling stakeholders to make informed decisions about property investment strategies.
Task 2: Data collection/curation¶
import re
import datetime
import warnings
import pandas as pd
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from matplotlib import rcParams
import matplotlib.pyplot as plt
from pathlib import Path
warnings.filterwarnings('ignore')
# sns.set_theme()
# figure size in inches
rcParams['figure.figsize'] = 11.7,5.27
The primary dataset required for this analysis is the HDB resale price data, which contains information about the resale prices of HDB flats in different locations and with various room types over past years. This dataset can be downloaded from data.gov.sg, specifically from these links:
- Resale flat prices based on registration date from Jan-2017 onwards
- Resale Flat Prices (Based on Registration Date), From Jan 2015 to Dec 2016
- Resale Flat Prices (Based on Registration Date), From Mar 2012 to Dec 2014
The 3 csv files are concatenated together to form the starting dataset.
To work towards to the objectives defined in Task 1, we intend to proceed with the following steps:
Data Cleaning and Preparation: Preprocess the data to handle missing values, outliers, and inconsistencies. Ensure proper formatting of variables such as location, room type, and price.
Exploratory Data Analysis (EDA):
- Conduct EDA on the integrated dataset to understand the distributions, correlations, and patterns within the data.
- Identify potential features that correlate with investment return and price growth, such as location, room type, economic indicators, and demographic factors.
Model Development:
- Identify Features: Select relevant features for modeling based on insights from EDA and domain knowledge.
- Machine Learning Algorithms: Choose appropriate machine learning algorithms (e.g., regression, ensemble methods) for predicting price growth.
- Model Training: Train the selected models on historical data, utilizing techniques such as cross-validation to ensure robustness.
Model Evaluation:
- Performance Metrics: Evaluate model performance using metrics such as accuracy, precision, F1 score, and RMSE (Root Mean Squared Error) for regression tasks.
- Validation: Validate models on holdout datasets or through time-series cross-validation to assess generalization performance.
Interpretation and Insights:
- Interpret model results to understand the factors driving investment return and price growth.
- Extract actionable insights for stakeholders, such as identifying high-return locations and room types or predicting future price trends.
Task 3: Data preparation¶
Initial Loading of data¶
For loading the data we will be using, we create a folder to put the curated input data:
if not Path("./input").exists():
Path('./input').mkdir()
The 3 files from 2012 to present day (2024) are downloaded from:
- Resale flat prices based on registration date from Jan-2017 onwards
- Resale Flat Prices (Based on Registration Date), From Jan 2015 to Dec 2016
- Resale Flat Prices (Based on Registration Date), From Mar 2012 to Dec 2014
and placed into the relative directory ./input as csvs.
This is provided in the accompanying folder (downloaded 19th March) however please redownload from the aforementioned sources to rebuild the dataset on the latest data if necessary.
The following reads the files into memory and shows the types of data available:
files = Path('./input').glob('*.csv')
all_files = []
for file in files:
df = pd.read_csv(file)
print(f"Rows: {len(df)}, file: {file.name}")
all_files.append(df)
dfs = pd.concat(all_files)
dfs = dfs.sort_values('month')
Rows: 37153, file: ResaleFlatPricesBasedonRegistrationDateFromJan2015toDec2016.csv Rows: 175110, file: ResaleflatpricesbasedonregistrationdatefromJan2017onwards.csv Rows: 52203, file: ResaleFlatPricesBasedonRegistrationDateFromMar2012toDec2014.csv
#plotting the heatmap
corr_matrix = dfs.corr(numeric_only=True)
plt.figure(figsize = (8,6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()
# Taking a sample to show what the data looks like
dfs.sample(n=10)
| month | town | flat_type | block | street_name | storey_range | floor_area_sqm | flat_model | lease_commence_date | remaining_lease | resale_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 3509 | 2015-03 | TAMPINES | 4 ROOM | 730 | TAMPINES ST 71 | 10 TO 12 | 101.0 | Model A | 1997 | 81 | 454000.0 |
| 18071 | 2017-11 | PUNGGOL | 4 ROOM | 268B | PUNGGOL FIELD | 04 TO 06 | 93.0 | Premium Apartment | 2013 | 94 years 02 months | 413000.0 |
| 108888 | 2021-09 | SENGKANG | 5 ROOM | 203A | COMPASSVALE RD | 07 TO 09 | 110.0 | Improved | 2000 | 78 years 04 months | 508000.0 |
| 20711 | 2016-03 | CHOA CHU KANG | 4 ROOM | 685C | CHOA CHU KANG CRES | 10 TO 12 | 85.0 | Model A2 | 2002 | 85 | 330000.0 |
| 9826 | 2015-07 | SERANGOON | 5 ROOM | 518 | SERANGOON NTH AVE 4 | 04 TO 06 | 122.0 | Improved | 1992 | 76 | 515000.0 |
| 83601 | 2020-11 | HOUGANG | 4 ROOM | 580 | HOUGANG AVE 4 | 04 TO 06 | 100.0 | Model A | 1997 | 76 years 01 month | 416000.0 |
| 25658 | 2016-06 | BEDOK | 2 ROOM | 101 | BEDOK NTH AVE 4 | 04 TO 06 | 45.0 | Improved | 1978 | 61 | 235000.0 |
| 27666 | 2016-07 | BUKIT MERAH | 2 ROOM | 2 | JLN BT HO SWEE | 04 TO 06 | 34.0 | Standard | 1971 | 54 | 215000.0 |
| 130453 | 2022-07 | CLEMENTI | 3 ROOM | 711 | CLEMENTI WEST ST 2 | 07 TO 09 | 67.0 | New Generation | 1980 | 57 years 05 months | 345000.0 |
| 32294 | 2016-09 | TOA PAYOH | 4 ROOM | 213 | LOR 8 TOA PAYOH | 01 TO 03 | 82.0 | Improved | 1975 | 57 | 340000.0 |
dfs.info()
<class 'pandas.core.frame.DataFrame'> Index: 264466 entries, 1303 to 173931 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 month 264466 non-null object 1 town 264466 non-null object 2 flat_type 264466 non-null object 3 block 264466 non-null object 4 street_name 264466 non-null object 5 storey_range 264466 non-null object 6 floor_area_sqm 264466 non-null float64 7 flat_model 264466 non-null object 8 lease_commence_date 264466 non-null int64 9 remaining_lease 212263 non-null object 10 resale_price 264466 non-null float64 dtypes: float64(2), int64(1), object(8) memory usage: 24.2+ MB
Data cleaning and handling null values¶
From the above information, we can see that as remaining_lease is in the following format, ## years ## months, we will need to clean this data into a metric that would make sense / sortable as it is mostly continuous data. We can express this as [year].[month/12].
For the NaN values (Not a number, akin to None or Null in this case), we chose to impute the value based on this year (2024) - lease_commence_date:
# The value True is the number of na values in the dataset
dfs['remaining_lease'].apply(lambda x: pd.isna(x)).value_counts()
remaining_lease False 212263 True 52203 Name: count, dtype: int64
# We can fill in these by using 2024 (or this year) - lease_commence_date and reassigning this back in remaining_lease as str to be
# consistent with the previous format for year only dates
dfs['remaining_lease'] = dfs.apply(lambda row: (f"{datetime.datetime.now().year - row['lease_commence_date']}"
if pd.isna(row['remaining_lease']) else row['remaining_lease']), axis=1)
dfs['remaining_lease'].apply(lambda x: pd.isna(x)).value_counts()
remaining_lease False 264466 Name: count, dtype: int64
print(f'Percentage null values {(52203 / (212263 + 52203)) * 100:.2f}%')
Percentage null values 19.74%
Knowing that imputing / filling in the values of null with the column average may bias the dataset from the other rows (ie the average price of town Orchard would unlikely yield the same average price for Tuas had their been any residential lots there), we decided to leave the null values as is.
Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8426774/
APA citation: Hyuk Lee, J., & Huber Jr., J. C. (2021). Evaluation of multiple imputation with large proportions of missing data: How much is too much? Iranian Journal of Public Health. https://doi.org/10.18502/ijph.v50i7.6626
Research here done multiple imputation on dataset with 20% missing values and 80% missing values found that higher missing values warrants a greater need for MI becayse MI produced less biased estimates under all missing mechanisms. Since less than 20% is still an acceptable threshold of missing value, the missing values in remaining_lease is kept.
In addition, the following were discovered:
resale_priceall end with,000can be reduced to be expressed as ##k for simplification. (i.e.resale_price/1000)- The block infomation is irrelevant to the analysis, beacause we have town to analysis on, therefore we can drop this column
- The street_name is also irrelevant the same reason as block, therefore we can drop this column
'''This part of the code take care of the remaining_lease column and replace it with a column called lease_left'''
pattern = re.compile("(?P<year>\d+)(?:\ (?P<month>\d+) months)?")
def lease_to_float(remaining_lease) -> float:
"""
args:
remaining_lease (str): form of "\d+ years \d+ months"
returns:
int: year match
int: month match if any
float64:
float representation of year with
month expressed as decimal hence sortable
"""
if pd.isna(remaining_lease):
return None, None, None
remaining_lease = str(remaining_lease)
matches = pattern.findall(remaining_lease)
matches = [i[0] for i in matches]
if len(matches) < 2:
matches.append(0)
return matches + [float(matches[0])+(float(matches[1])/12)]
dfs = dfs.reset_index(drop=True)
# Apply the cleaning function to the data
dfs[['year', 'months', 'lease_left']] = pd.DataFrame(dfs['remaining_lease'].apply(lease_to_float).to_list(), index=dfs.index)
# Check if any data loss during the cleaning process
dfs[['lease_left', 'remaining_lease']].info()
# Create a column that is casted to datetime
dfs['month_dt'] = pd.to_datetime(dfs['month'], format='%Y-%m')
# drop the original column and save the cleaned data
dfs = dfs.drop(columns=['remaining_lease', 'month'])
dfs.to_csv('2012-3--2024-3.csv', index=False)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 264466 entries, 0 to 264465 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 lease_left 264466 non-null float64 1 remaining_lease 264466 non-null object dtypes: float64(1), object(1) memory usage: 4.0+ MB
# The code below drop block and street_name columns and reformats the resale_price column
dfs['resale_price_thousand'] = dfs['resale_price'] / 1000
dfs = dfs.drop(columns=['block', 'street_name'])
# Check the current data format
dfs.sample(n=5)
| town | flat_type | storey_range | floor_area_sqm | flat_model | lease_commence_date | resale_price | year | months | lease_left | month_dt | resale_price_thousand | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 233014 | SENGKANG | 4 ROOM | 01 TO 03 | 94.0 | Model A | 2018 | 545000.0 | 94 | 07 | 94.583333 | 2023-01-01 | 545.0 |
| 193274 | GEYLANG | 4 ROOM | 13 TO 15 | 85.0 | Model A | 2006 | 670000.0 | 84 | 05 | 84.416667 | 2021-07-01 | 670.0 |
| 72342 | CHOA CHU KANG | 4 ROOM | 07 TO 09 | 108.0 | Model A | 1995 | 385000.0 | 78 | 0 | 78.000000 | 2016-02-01 | 385.0 |
| 107972 | JURONG WEST | 5 ROOM | 04 TO 06 | 110.0 | Improved | 2001 | 505000.0 | 82 | 10 | 82.833333 | 2017-11-01 | 505.0 |
| 7522 | YISHUN | 4 ROOM | 07 TO 09 | 84.0 | Simplified | 1988 | 350000.0 | 36 | 0 | 36.000000 | 2012-06-01 | 350.0 |
dfs.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 264466 entries, 0 to 264465 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 town 264466 non-null object 1 flat_type 264466 non-null object 2 storey_range 264466 non-null object 3 floor_area_sqm 264466 non-null float64 4 flat_model 264466 non-null object 5 lease_commence_date 264466 non-null int64 6 resale_price 264466 non-null float64 7 year 264466 non-null object 8 months 264466 non-null object 9 lease_left 264466 non-null float64 10 month_dt 264466 non-null datetime64[ns] 11 resale_price_thousand 264466 non-null float64 dtypes: datetime64[ns](1), float64(4), int64(1), object(6) memory usage: 24.2+ MB
dfs.describe()
| floor_area_sqm | lease_commence_date | resale_price | lease_left | month_dt | resale_price_thousand | |
|---|---|---|---|---|---|---|
| count | 264466.000000 | 264466.000000 | 2.644660e+05 | 264466.000000 | 264466 | 264466.000000 |
| mean | 96.992172 | 1993.971195 | 4.797057e+05 | 66.622245 | 2018-08-27 11:57:46.381312 | 479.705659 |
| min | 31.000000 | 1966.000000 | 1.000000e+05 | 12.000000 | 2012-03-01 00:00:00 | 100.000000 |
| 25% | 76.000000 | 1984.000000 | 3.638880e+05 | 55.083333 | 2015-10-01 00:00:00 | 363.888000 |
| 50% | 94.000000 | 1993.000000 | 4.500000e+05 | 68.000000 | 2019-01-01 00:00:00 | 450.000000 |
| 75% | 112.000000 | 2003.000000 | 5.633750e+05 | 82.000000 | 2021-09-01 00:00:00 | 563.375000 |
| max | 280.000000 | 2022.000000 | 1.568888e+06 | 97.750000 | 2024-03-01 00:00:00 | 1568.888000 |
| std | 24.243036 | 13.202825 | 1.594134e+05 | 20.468383 | NaN | 159.413352 |
Task 4: Exploratory data analysis and visualization¶
Understanding the data¶
To understand the data, we can define a helper function to help plot some categorical variables
#summary statistics of categorical data
print(dfs[['town','flat_type','flat_model']].describe())
town flat_type flat_model count 264466 264466 264466 unique 26 7 21 top SENGKANG 4 ROOM Model A freq 20288 109186 86288
#plotting some visulaisations
plt.figure(figsize=(20,5))
plt.subplot(1,3,1)
sns.countplot(x='town', data = dfs)
plt.xticks(rotation=90)
plt.title('Distribution of towns')
plt.figure(figsize=(20,5))
plt.subplot(1,3,1)
sns.countplot(x='flat_type', data = dfs)
plt.xticks(rotation=90)
plt.title('Distribution of flat types')
plt.figure(figsize=(20,5))
plt.subplot(1,3,1)
sns.countplot(x='flat_model', data = dfs)
plt.xticks(rotation=90)
plt.title('Distribution of flat models')
plt.tight_layout()
plt.show()
cross_tab = pd.crosstab(dfs['flat_type'], dfs['flat_model'])
print(cross_tab)
flat_model 2-room 3Gen Adjoined flat Apartment DBSS Improved \ flat_type 1 ROOM 0 0 0 0 0 119 2 ROOM 85 0 0 0 1 832 3 ROOM 0 0 0 0 430 17026 4 ROOM 0 0 50 0 1259 3113 5 ROOM 0 28 235 0 1555 45311 EXECUTIVE 0 0 157 10122 0 0 MULTI-GENERATION 0 0 0 0 0 0 flat_model Improved-Maisonette Maisonette Model A \ flat_type 1 ROOM 0 0 0 2 ROOM 0 0 2008 3 ROOM 0 0 16370 4 ROOM 0 0 64504 5 ROOM 35 0 3406 EXECUTIVE 0 7475 0 MULTI-GENERATION 0 0 0 flat_model Model A-Maisonette ... Multi Generation New Generation \ flat_type ... 1 ROOM 0 ... 0 0 2 ROOM 0 ... 0 0 3 ROOM 0 ... 0 24406 4 ROOM 0 ... 0 13312 5 ROOM 459 ... 0 0 EXECUTIVE 0 ... 0 0 MULTI-GENERATION 0 ... 103 0 flat_model Premium Apartment Premium Apartment Loft \ flat_type 1 ROOM 0 0 2 ROOM 57 0 3 ROOM 1115 0 4 ROOM 14342 81 5 ROOM 10057 26 EXECUTIVE 2342 0 MULTI-GENERATION 0 0 flat_model Premium Maisonette Simplified Standard Terrace Type S1 \ flat_type 1 ROOM 0 0 0 0 0 2 ROOM 0 0 989 0 0 3 ROOM 0 3122 4146 139 0 4 ROOM 0 8590 108 12 431 5 ROOM 0 0 2889 0 0 EXECUTIVE 20 0 0 0 0 MULTI-GENERATION 0 0 0 0 0 flat_model Type S2 flat_type 1 ROOM 0 2 ROOM 0 3 ROOM 0 4 ROOM 0 5 ROOM 215 EXECUTIVE 0 MULTI-GENERATION 0 [7 rows x 21 columns]
plt.figure(figsize=(20, 6))
sns.heatmap(cross_tab, annot=True, cmap='YlGnBu', fmt='d')
plt.title('Cross-tabulation of Flat Type and Flat Model')
plt.xlabel('Flat Model')
plt.ylabel('Flat Type')
plt.show()
from scipy.stats import chi2_contingency
chi2, p, _, _ = chi2_contingency(cross_tab)
print(f"Chi-square statistic: {chi2}, p-value: {p}")
Chi-square statistic: 683545.8999649041, p-value: 0.0
Floor-size over time Trend Analysis:¶
- Check the distribution of floor size and resale prices
- Analyze the trend of flat sizes (floor area) resale price changes over time.
plt.figure(figsize=(120, 40)) # Adjust the figure size to accommodate both plots
# Plot for floor area
plt.subplot(1, 2, 1) # 1 row, 2 columns, plot number 1
plt.hist(dfs['floor_area_sqm'], bins=50, edgecolor='black') # Adjust the number of bins as needed
plt.title('Distribution of Floor Sizes', fontsize=100)
plt.xlabel('Floor Size (sqm)', fontsize=80) # Add unit to the x-axis label
plt.ylabel('Frequency', fontsize=80)
plt.xticks(range(0, int(max(dfs['floor_area_sqm']))+10, 20)) # Set x-axis ticks with a step size of 10
plt.tick_params(axis='both', which='major', labelsize=80) # Increase the size of the numbers on the axes
# Print a table with detailed statistics about floor area in the terminal
floor_area_stats = dfs['floor_area_sqm'].describe()
print("Floor Area Statistics:\n", floor_area_stats)
# Plot for resale price
plt.subplot(1, 2, 2) # 1 row, 2 columns, plot number 2
plt.hist(dfs['resale_price_thousand'], bins=100, edgecolor='black') # Adjust the number of bins as needed
plt.title('Distribution of Resale Prices', fontsize=100)
plt.xlabel('Resale Price (thousand $)', fontsize=80) # Add unit to the x-axis label
plt.ylabel('Frequency', fontsize=80)
plt.xticks(range(0, int(max(dfs['resale_price_thousand']))+100, 100)) # Set x-axis ticks with a step size of 10
plt.tick_params(axis='both', which='major', labelsize=80) # Increase the size of the numbers on the axes
# Print a table with detailed statistics about resale price in the terminal
resale_price_stats = dfs['resale_price_thousand'].describe()
print("Resale Price Statistics:\n", resale_price_stats)
plt.show()
Floor Area Statistics: count 264466.000000 mean 96.992172 std 24.243036 min 31.000000 25% 76.000000 50% 94.000000 75% 112.000000 max 280.000000 Name: floor_area_sqm, dtype: float64 Resale Price Statistics: count 264466.000000 mean 479.705659 std 159.413352 min 100.000000 25% 363.888000 50% 450.000000 75% 563.375000 max 1568.888000 Name: resale_price_thousand, dtype: float64
Below, we continue to plot the price changes over time, seperating them into bins according to floor area to observe how the different price / size changed over time:
dfs.sort_values('month_dt', inplace=True)
# Define bins for different ranges of sizes
bins = [0, 76, 94, 112, 280] # Adjust these values as needed
# Create a new column for the size range
dfs['size_range'] = pd.cut(dfs['floor_area_sqm'], bins)
# Group by size range and month, then calculate the mean resale price
grouped = dfs.groupby(['size_range', 'month_dt'], observed=True)['resale_price_thousand'].mean().reset_index()
# Plot the mean resale price over time for each size range
for size_range in grouped['size_range'].unique():
subset = grouped[grouped['size_range'] == size_range]
plt.plot(subset['month_dt'], subset['resale_price_thousand'], label=str(size_range))
plt.title('Trend of Prices Over Time by Size Range')
plt.xlabel('Date')
plt.ylabel('Mean Price')
plt.legend(title='Size Range')
plt.show()
From these plots, we can observe the following:
- The real estate market has seen an overall increase in prices from 2012 to 2024, with larger flats experiencing a more substantial rise. This suggests a sustained demand for larger properties, or a market perception of greater value in larger units.
- Investors might consider focusing on larger properties for potentially higher returns.
Town affect on resale price over time¶
# Create a list of unique towns
towns = dfs['town'].unique()
# Create a scatter plot of resale prices over time for each town
plt.figure(figsize=(15, 10))
for town in towns:
town_data = dfs[dfs['town'] == town]
plt.scatter(town_data['month_dt'], town_data['resale_price_thousand'], label=town)
plt.title('Resale Price by Town Over Time')
plt.xlabel('Time')
plt.ylabel('Resale Price (thousand $)')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left') # Place the legend outside the plot
plt.show()
# Grouping town data by month and taking the median to plot
town_medians = dfs.groupby(['town', 'month_dt'], as_index=False)[['resale_price_thousand']].median()
# Create a line plot of median resale prices over time for each town
plt.figure(figsize=(15, 10))
for town in towns:
town_data = town_medians[town_medians['town'] == town]
plt.plot(town_data['month_dt'], town_data['resale_price_thousand'], label=town)
plt.title('Median Resale Price by Town Over Time')
plt.xlabel('Time')
plt.ylabel('Median Resale Price (thousand $)')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left') # Place the legend outside the plot
plt.show()
The following can be observed:
- The provided visualizations reveal a market with increasing median housing prices over a 12-year span, marked by variability both within and across different towns.
- The data suggests that certain areas may be experiencing greater price appreciation, possibly reflecting their desirability or other positive attributes.
Investigating flat_type's effect on resale price¶
# Get unique flat types
flat_types = dfs['flat_type'].unique()
# Create a scatter plot of resale prices over time for each flat type
plt.figure(figsize=(15, 10))
for flat_type in flat_types:
flat_type_data = dfs[dfs['flat_type'] == flat_type]
plt.scatter(flat_type_data['month_dt'], flat_type_data['resale_price_thousand'], label=flat_type)
plt.title('Resale Price by Flat Type Over Time')
plt.xlabel('Time')
plt.ylabel('Resale Price (thousand $)')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left') # Place the legend outside the plot
plt.show()
From the above, we can observe:
- There is a clear stratification of resale prices based on flat types, with larger flats generally selling for more than smaller ones
- The data indicates a general increase in resale prices over the observed period for all flat types across time
- Despite the general increase, there's considerable variability within each flat type category, implying that other factors also play a significant role in determining resale prices
We can also view the distribution of flat_model and storey_range's in our dataset:
# Count and plot the number of occurrences of each unique value in 'flat_model'
flat_model_counts = dfs['flat_model'].value_counts()
flat_model_counts.plot(kind='bar', figsize=(10, 5))
plt.title('Counts of Flat Model')
plt.xlabel('Flat Model')
plt.ylabel('Count')
plt.show()
# Count and plot the number of occurrences of each unique value in 'story_range'
story_range_counts = dfs['storey_range'].value_counts()
story_range_counts.plot(kind='bar', figsize=(10, 5))
plt.title('Counts of Storey Range')
plt.xlabel('Storey Range')
plt.ylabel('Count')
plt.show()
For a general overview, the following is a interactive visualisation of the distribution and affect of housing price over time:
%%html
<div class='tableauPlaceholder' id='viz1710941321740' style='position: relative'><noscript><a href='#'><img alt='Dashboard 1 ' src='https://public.tableau.com/static/images/Si/SingaporeAverageResalePrice/Dashboard1/1_rss.png' style='border: none' /></a></noscript><object class='tableauViz' style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='SingaporeAverageResalePrice/Dashboard1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https://public.tableau.com/static/images/Si/SingaporeAverageResalePrice/Dashboard1/1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-GB' /><param name='filter' value='publish=yes' /></object></div> <script type='text/javascript'> var divElement = document.getElementById('viz1710941321740'); var vizElement = divElement.getElementsByTagName('object')[0]; if ( divElement.offsetWidth > 800 ) { vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else { vizElement.style.width='100%';vizElement.style.height='1027px';} var scriptElement = document.createElement('script'); scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';vizElement.parentNode.insertBefore(scriptElement, vizElement);</script>
ax = sns.lineplot(dfs, x='month_dt', y='resale_price', errorbar=('ci', 95), estimator='median')
ax.set_title('Median resale price per month')
plt.show()
fig = make_subplots(rows=1, cols=2, subplot_titles=[
'Distribution of resale price',
'Distribution of floor_area_sqm'
])
trace0 = go.Histogram(
x=dfs['resale_price'],
xbins={'size': 10000})
trace1 = go.Histogram(
x=dfs['floor_area_sqm'],
xbins={'size': 10}
)
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig.update_layout(
autosize=False,
width=1000,
height=500,
)
# Define x axis labels
fig.update_xaxes(title_text='Bins')
# Define y axis labels
fig.update_yaxes(title_text="Counts")
fig.show()
def plot_categorical_variables(df: pd.DataFrame, col: str) -> None:
"""Takes a dataframe and plot a counts for categorical variables
args:
df (pd.DataFrame): Dataframe to load data from
col (str): Name of column to plot
return:
None
"""
counts = dfs[col].value_counts()
# Columnar text counts for reference
print(f'Count of {col}')
print(counts, end='')
# Generate
counts = counts.reset_index()
fig = px.bar(
counts,
x=col,
y='count',
title=f'Counts of {col}',
labels = {
'count': f'Count of {col}'
}
)
fig.show()
# Define and plot the categorical variables reference
categorical_variables = [
'flat_model',
'flat_type',
'town',
]
for variable in categorical_variables:
plot_categorical_variables(dfs, variable)
Count of flat_model flat_model Model A 86288 Improved 66401 New Generation 37718 Premium Apartment 27913 Simplified 11712 Apartment 10122 Standard 8132 Maisonette 7475 Model A2 3384 DBSS 3245 Model A-Maisonette 459 Adjoined flat 442 Type S1 431 Type S2 215 Terrace 151 Premium Apartment Loft 107 Multi Generation 103 2-room 85 Improved-Maisonette 35 3Gen 28 Premium Maisonette 20 Name: count, dtype: int64
Count of flat_type flat_type 4 ROOM 109186 3 ROOM 66754 5 ROOM 64216 EXECUTIVE 20116 2 ROOM 3972 1 ROOM 119 MULTI-GENERATION 103 Name: count, dtype: int64
Count of town town SENGKANG 20288 WOODLANDS 19513 JURONG WEST 18834 TAMPINES 18159 YISHUN 17754 PUNGGOL 16022 BEDOK 15187 HOUGANG 13387 CHOA CHU KANG 11791 ANG MO KIO 11644 BUKIT MERAH 10212 BUKIT BATOK 10182 BUKIT PANJANG 9612 TOA PAYOH 8312 PASIR RIS 8198 KALLANG/WHAMPOA 7912 QUEENSTOWN 7296 SEMBAWANG 6985 GEYLANG 6851 CLEMENTI 6194 JURONG EAST 5655 SERANGOON 5168 BISHAN 4818 CENTRAL AREA 2148 MARINE PARADE 1713 BUKIT TIMAH 631 Name: count, dtype: int64
Task 5: Modeling and visualization¶
# With this, we get the median resale price per month for each flat_type in each town
grouped_town_flat = dfs.groupby(['town', 'month_dt', 'flat_type'], as_index=False)[['resale_price_thousand']].median()
# Thereafter we sort by town, flat_type, and hence most importantly price and get the first and last entries representing
# the lowest median price and highest median price for a given month
grouped_town_flat_month_median = grouped_town_flat.sort_values(['town', 'flat_type', 'resale_price_thousand'])\
.groupby(['town', 'flat_type'])\
.agg(['first', 'last'])
grouped_town_flat_month_median
| month_dt | resale_price_thousand | ||||
|---|---|---|---|---|---|
| first | last | first | last | ||
| town | flat_type | ||||
| ANG MO KIO | 2 ROOM | 2019-09-01 | 2023-02-01 | 178.0 | 310.944 |
| 3 ROOM | 2019-05-01 | 2024-02-01 | 265.0 | 403.000 | |
| 4 ROOM | 2020-05-01 | 2023-01-01 | 380.0 | 735.000 | |
| 5 ROOM | 2020-04-01 | 2024-01-01 | 525.0 | 992.944 | |
| EXECUTIVE | 2018-01-01 | 2023-09-01 | 660.0 | 1180.000 | |
| ... | ... | ... | ... | ... | ... |
| YISHUN | 3 ROOM | 2020-05-01 | 2024-02-01 | 250.0 | 409.888 |
| 4 ROOM | 2018-02-01 | 2024-03-01 | 325.0 | 542.000 | |
| 5 ROOM | 2018-02-01 | 2023-12-01 | 410.0 | 678.000 | |
| EXECUTIVE | 2017-10-01 | 2022-10-01 | 499.0 | 875.400 | |
| MULTI-GENERATION | 2016-09-01 | 2023-01-01 | 650.0 | 1080.000 | |
131 rows × 4 columns
# lets rename the columns
grouped_town_flat_month_median.reset_index(inplace=True)
grouped_town_flat_month_median.columns = ['town',
'flat_type',
'month_min',
'month_max',
'resale_price_thousand_min',
'resale_price_thousand_max']
grouped_town_flat_month_median
| town | flat_type | month_min | month_max | resale_price_thousand_min | resale_price_thousand_max | |
|---|---|---|---|---|---|---|
| 0 | ANG MO KIO | 2 ROOM | 2019-09-01 | 2023-02-01 | 178.0 | 310.944 |
| 1 | ANG MO KIO | 3 ROOM | 2019-05-01 | 2024-02-01 | 265.0 | 403.000 |
| 2 | ANG MO KIO | 4 ROOM | 2020-05-01 | 2023-01-01 | 380.0 | 735.000 |
| 3 | ANG MO KIO | 5 ROOM | 2020-04-01 | 2024-01-01 | 525.0 | 992.944 |
| 4 | ANG MO KIO | EXECUTIVE | 2018-01-01 | 2023-09-01 | 660.0 | 1180.000 |
| ... | ... | ... | ... | ... | ... | ... |
| 126 | YISHUN | 3 ROOM | 2020-05-01 | 2024-02-01 | 250.0 | 409.888 |
| 127 | YISHUN | 4 ROOM | 2018-02-01 | 2024-03-01 | 325.0 | 542.000 |
| 128 | YISHUN | 5 ROOM | 2018-02-01 | 2023-12-01 | 410.0 | 678.000 |
| 129 | YISHUN | EXECUTIVE | 2017-10-01 | 2022-10-01 | 499.0 | 875.400 |
| 130 | YISHUN | MULTI-GENERATION | 2016-09-01 | 2023-01-01 | 650.0 | 1080.000 |
131 rows × 6 columns
From the following, we can use it to calculate Returns on investment (ROI) defined as:
$\text{ROI} = \frac{\text{Final value of investment} - \text{Initial value of investment}}{\text{Cost of investment}} \times 100\%$
Where in this case, as we are are finding the highest hypothetical return, would be where someone had the opportunity to buy low and sell high; becomes the following:
$\text{ROI} = \frac{\text{resale\_price\_thousand\_max - resale\_price\_thousand\_min}}{\text{resale\_price\_thousand\_min}} \times 100\%$
# Calculating ROI
grouped_town_flat_month_median['median_roi'] = ((grouped_town_flat_month_median['resale_price_thousand_max'] - \
grouped_town_flat_month_median['resale_price_thousand_min']) / \
grouped_town_flat_month_median['resale_price_thousand_min']) * 100
top_50 = grouped_town_flat_month_median.sort_values(by=['median_roi'], ascending=False).head(50)
top_50
| town | flat_type | month_min | month_max | resale_price_thousand_min | resale_price_thousand_max | median_roi | |
|---|---|---|---|---|---|---|---|
| 21 | BUKIT MERAH | 2 ROOM | 2020-05-01 | 2024-02-01 | 170.000 | 523.000 | 207.647059 |
| 37 | CENTRAL AREA | 4 ROOM | 2014-08-01 | 2023-08-01 | 420.000 | 1200.000 | 185.714286 |
| 80 | PASIR RIS | 3 ROOM | 2019-10-01 | 2023-12-01 | 215.000 | 610.000 | 183.720930 |
| 38 | CENTRAL AREA | 5 ROOM | 2012-03-01 | 2023-10-01 | 575.500 | 1438.888 | 150.023979 |
| 45 | CLEMENTI | 2 ROOM | 2019-06-01 | 2023-05-01 | 205.000 | 492.000 | 140.000000 |
| 89 | QUEENSTOWN | 2 ROOM | 2018-11-01 | 2023-12-01 | 185.000 | 431.500 | 133.243243 |
| 48 | CLEMENTI | 5 ROOM | 2018-01-01 | 2024-01-01 | 526.000 | 1180.000 | 124.334601 |
| 35 | CENTRAL AREA | 2 ROOM | 2020-06-01 | 2023-06-01 | 200.000 | 430.000 | 115.000000 |
| 47 | CLEMENTI | 4 ROOM | 2019-04-01 | 2024-02-01 | 418.000 | 870.000 | 108.133971 |
| 52 | GEYLANG | 4 ROOM | 2014-08-01 | 2023-10-01 | 377.000 | 770.000 | 104.244032 |
| 64 | JURONG EAST | EXECUTIVE | 2017-01-01 | 2023-09-01 | 500.000 | 1020.000 | 104.000000 |
| 117 | TOA PAYOH | 4 ROOM | 2015-04-01 | 2022-10-01 | 415.000 | 844.444 | 103.480482 |
| 10 | BISHAN | 3 ROOM | 2016-01-01 | 2023-08-01 | 260.000 | 520.000 | 100.000000 |
| 5 | BEDOK | 2 ROOM | 2019-12-01 | 2023-07-01 | 180.000 | 355.000 | 97.222222 |
| 120 | WOODLANDS | 2 ROOM | 2018-11-01 | 2024-02-01 | 180.000 | 355.000 | 97.222222 |
| 2 | ANG MO KIO | 4 ROOM | 2020-05-01 | 2023-01-01 | 380.000 | 735.000 | 93.421053 |
| 121 | WOODLANDS | 3 ROOM | 2020-04-01 | 2024-02-01 | 223.000 | 430.000 | 92.825112 |
| 50 | GEYLANG | 2 ROOM | 2019-10-01 | 2023-10-01 | 150.000 | 287.944 | 91.962667 |
| 107 | SERANGOON | 5 ROOM | 2016-03-01 | 2024-01-01 | 455.888 | 871.000 | 91.055698 |
| 22 | BUKIT MERAH | 3 ROOM | 2020-04-01 | 2022-02-01 | 283.500 | 540.000 | 90.476190 |
| 3 | ANG MO KIO | 5 ROOM | 2020-04-01 | 2024-01-01 | 525.000 | 992.944 | 89.132190 |
| 65 | JURONG WEST | 2 ROOM | 2018-12-01 | 2023-03-01 | 192.000 | 360.000 | 87.500000 |
| 118 | TOA PAYOH | 5 ROOM | 2020-02-01 | 2024-03-01 | 550.000 | 1026.944 | 86.717091 |
| 9 | BEDOK | EXECUTIVE | 2019-06-01 | 2024-03-01 | 590.000 | 1100.000 | 86.440678 |
| 30 | BUKIT PANJANG | EXECUTIVE | 2016-07-01 | 2023-03-01 | 450.000 | 838.000 | 86.222222 |
| 119 | TOA PAYOH | EXECUTIVE | 2019-07-01 | 2024-02-01 | 600.000 | 1114.444 | 85.740667 |
| 73 | KALLANG/WHAMPOA | 5 ROOM | 2020-05-01 | 2023-08-01 | 523.500 | 970.000 | 85.291309 |
| 18 | BUKIT BATOK | 5 ROOM | 2019-11-01 | 2022-06-01 | 470.000 | 865.000 | 84.042553 |
| 53 | GEYLANG | 5 ROOM | 2020-01-01 | 2018-07-01 | 500.000 | 920.000 | 84.000000 |
| 71 | KALLANG/WHAMPOA | 3 ROOM | 2020-04-01 | 2024-03-01 | 244.000 | 448.000 | 83.606557 |
| 55 | HOUGANG | 2 ROOM | 2019-08-01 | 2024-02-01 | 205.000 | 375.000 | 82.926829 |
| 17 | BUKIT BATOK | 4 ROOM | 2020-04-01 | 2023-06-01 | 332.000 | 603.500 | 81.777108 |
| 105 | SERANGOON | 3 ROOM | 2019-09-01 | 2023-09-01 | 255.000 | 463.000 | 81.568627 |
| 34 | BUKIT TIMAH | EXECUTIVE | 2012-03-01 | 2023-12-01 | 745.000 | 1350.000 | 81.208054 |
| 32 | BUKIT TIMAH | 4 ROOM | 2012-07-01 | 2023-11-01 | 495.000 | 890.000 | 79.797980 |
| 4 | ANG MO KIO | EXECUTIVE | 2018-01-01 | 2023-09-01 | 660.000 | 1180.000 | 78.787879 |
| 108 | SERANGOON | EXECUTIVE | 2015-02-01 | 2022-10-01 | 607.000 | 1085.000 | 78.747941 |
| 19 | BUKIT BATOK | EXECUTIVE | 2015-01-01 | 2023-05-01 | 535.000 | 954.000 | 78.317757 |
| 96 | SEMBAWANG | 4 ROOM | 2020-04-01 | 2023-11-01 | 320.000 | 570.000 | 78.125000 |
| 97 | SEMBAWANG | 5 ROOM | 2018-03-01 | 2023-12-01 | 370.000 | 650.000 | 75.675676 |
| 129 | YISHUN | EXECUTIVE | 2017-10-01 | 2022-10-01 | 499.000 | 875.400 | 75.430862 |
| 98 | SEMBAWANG | EXECUTIVE | 2018-03-01 | 2023-07-01 | 448.888 | 786.000 | 75.099357 |
| 74 | KALLANG/WHAMPOA | EXECUTIVE | 2019-11-01 | 2023-03-01 | 655.000 | 1146.000 | 74.961832 |
| 72 | KALLANG/WHAMPOA | 4 ROOM | 2014-10-01 | 2022-02-01 | 480.000 | 839.000 | 74.791667 |
| 0 | ANG MO KIO | 2 ROOM | 2019-09-01 | 2023-02-01 | 178.000 | 310.944 | 74.687640 |
| 87 | PUNGGOL | 5 ROOM | 2018-06-01 | 2024-01-01 | 424.000 | 740.000 | 74.528302 |
| 114 | TAMPINES | MULTI-GENERATION | 2012-06-01 | 2024-03-01 | 610.000 | 1060.000 | 73.770492 |
| 124 | WOODLANDS | EXECUTIVE | 2016-09-01 | 2024-03-01 | 530.000 | 915.000 | 72.641509 |
| 41 | CHOA CHU KANG | 3 ROOM | 2017-03-01 | 2024-03-01 | 250.000 | 430.000 | 72.000000 |
| 13 | BISHAN | EXECUTIVE | 2013-09-01 | 2024-02-01 | 755.000 | 1295.000 | 71.523179 |
print(top_50['flat_type'].value_counts())
print()
print(top_50['town'].value_counts())
flat_type EXECUTIVE 13 2 ROOM 10 5 ROOM 10 4 ROOM 9 3 ROOM 7 MULTI-GENERATION 1 Name: count, dtype: int64 town ANG MO KIO 4 KALLANG/WHAMPOA 4 SERANGOON 3 SEMBAWANG 3 CLEMENTI 3 GEYLANG 3 BUKIT BATOK 3 TOA PAYOH 3 CENTRAL AREA 3 WOODLANDS 3 BUKIT TIMAH 2 BUKIT MERAH 2 BEDOK 2 BISHAN 2 JURONG WEST 1 BUKIT PANJANG 1 JURONG EAST 1 HOUGANG 1 QUEENSTOWN 1 PASIR RIS 1 YISHUN 1 PUNGGOL 1 TAMPINES 1 CHOA CHU KANG 1 Name: count, dtype: int64
From the above, we can observe within the top 50 median resale prices over time:
- Ang Mo Kio and KALLANG/WHAMPOA represented one of the highest towns for investing, as they have more
flat_typesthat had a high ROI - Executive flats is represented with the highest ROI in terms of
flat_typesacross many different towns (13 / 26 different towns are in the top 50)
import datetime
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import make_column_selector
from sklearn.model_selection import train_test_split
dfs['month_timestamp'] = dfs['month_dt'].apply(lambda x: x.timestamp())
train_cols = [
'month_timestamp',
'town',
'flat_type',
'storey_range',
'floor_area_sqm',
'flat_model',
'lease_commence_date',
'lease_left'
]
X_train, X_test, y_train, y_test = train_test_split(
dfs[train_cols], dfs["resale_price"], test_size=0.25
)
dfs['flat_type'].drop_duplicates().sort_values().to_list()
['1 ROOM', '2 ROOM', '3 ROOM', '4 ROOM', '5 ROOM', 'EXECUTIVE', 'MULTI-GENERATION']
dfs['storey_range'].drop_duplicates().sort_values().to_list()
['01 TO 03', '01 TO 05', '04 TO 06', '06 TO 10', '07 TO 09', '10 TO 12', '11 TO 15', '13 TO 15', '16 TO 18', '16 TO 20', '19 TO 21', '21 TO 25', '22 TO 24', '25 TO 27', '26 TO 30', '28 TO 30', '31 TO 33', '31 TO 35', '34 TO 36', '36 TO 40', '37 TO 39', '40 TO 42', '43 TO 45', '46 TO 48', '49 TO 51']
ct = make_column_transformer(
(OneHotEncoder(), make_column_selector('town|flat_model|flat_type')),
(OrdinalEncoder(), make_column_selector('flat_type|storey_range'))
, remainder=StandardScaler())
ct.fit(X_train, y=y_train)
ColumnTransformer(remainder=StandardScaler(),
transformers=[('onehotencoder', OneHotEncoder(),
<sklearn.compose._column_transformer.make_column_selector object at 0x298494810>),
('ordinalencoder', OrdinalEncoder(),
<sklearn.compose._column_transformer.make_column_selector object at 0x298497c50>)])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(remainder=StandardScaler(),
transformers=[('onehotencoder', OneHotEncoder(),
<sklearn.compose._column_transformer.make_column_selector object at 0x298494810>),
('ordinalencoder', OrdinalEncoder(),
<sklearn.compose._column_transformer.make_column_selector object at 0x298497c50>)])<sklearn.compose._column_transformer.make_column_selector object at 0x298494810>
OneHotEncoder()
<sklearn.compose._column_transformer.make_column_selector object at 0x298497c50>
OrdinalEncoder()
['month_timestamp', 'floor_area_sqm', 'lease_commence_date', 'lease_left']
StandardScaler()
model = Pipeline([('Transform', ct),
('todense', FunctionTransformer(lambda x: np.array(x.todense()), accept_sparse=True)),
('hgbr', HistGradientBoostingRegressor())])
model.fit(X_train, y=y_train)
Pipeline(steps=[('Transform',
ColumnTransformer(remainder=StandardScaler(),
transformers=[('onehotencoder',
OneHotEncoder(),
<sklearn.compose._column_transformer.make_column_selector object at 0x298494810>),
('ordinalencoder',
OrdinalEncoder(),
<sklearn.compose._column_transformer.make_column_selector object at 0x298497c50>)])),
('todense',
FunctionTransformer(accept_sparse=True,
func=<function <lambda> at 0x2d5232de0>)),
('hgbr', HistGradientBoostingRegressor())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('Transform',
ColumnTransformer(remainder=StandardScaler(),
transformers=[('onehotencoder',
OneHotEncoder(),
<sklearn.compose._column_transformer.make_column_selector object at 0x298494810>),
('ordinalencoder',
OrdinalEncoder(),
<sklearn.compose._column_transformer.make_column_selector object at 0x298497c50>)])),
('todense',
FunctionTransformer(accept_sparse=True,
func=<function <lambda> at 0x2d5232de0>)),
('hgbr', HistGradientBoostingRegressor())])ColumnTransformer(remainder=StandardScaler(),
transformers=[('onehotencoder', OneHotEncoder(),
<sklearn.compose._column_transformer.make_column_selector object at 0x298494810>),
('ordinalencoder', OrdinalEncoder(),
<sklearn.compose._column_transformer.make_column_selector object at 0x298497c50>)])<sklearn.compose._column_transformer.make_column_selector object at 0x298494810>
OneHotEncoder()
<sklearn.compose._column_transformer.make_column_selector object at 0x298497c50>
OrdinalEncoder()
['month_timestamp', 'floor_area_sqm', 'lease_commence_date', 'lease_left']
StandardScaler()
FunctionTransformer(accept_sparse=True, func=<function <lambda> at 0x2d5232de0>)
HistGradientBoostingRegressor()
model.score(X_test, y_test)
0.9363627142697488
set(dfs['flat_model'])
{'2-room',
'3Gen',
'Adjoined flat',
'Apartment',
'DBSS',
'Improved',
'Improved-Maisonette',
'Maisonette',
'Model A',
'Model A-Maisonette',
'Model A2',
'Multi Generation',
'New Generation',
'Premium Apartment',
'Premium Apartment Loft',
'Premium Maisonette',
'Simplified',
'Standard',
'Terrace',
'Type S1',
'Type S2'}
print('This timestamp will be used as time now', timenow:= datetime.datetime.now().timestamp())
This timestamp will be used as time now 1711895409.432791
prediction = model.predict(pd.DataFrame(
{
'month_timestamp': timenow,
'town': 'CLEMENTI',
'flat_type': '3 ROOM',
'storey_range': '01 TO 05',
'floor_area_sqm': 121.0,
'flat_model': 'Standard',
'lease_commence_date': 2012,
'lease_left': 87.8
},
index=[0]
))
print(f'The predicted resale price for the given parameters is: ${round(prediction[0])} SGD')
The predicted resale price for the given parameters is: $798524 SGD
Task 6: Report insights and conclusions¶
In the report, describe all the tasks above. You can describe each task at relevant places in the notebook rather than writing a long essay at the beginning/in the end. In your reports, have a section that explicitly presents your insights into the problem being solved and the conclusions you draw.